We deploy the MERT-95M^{RVQ-VAE} model on a machine with 2 vCPU and 16GB RAM.

The tasks include EMO, GS, MTGInstrument, MTGGenre, MTGTop50, MTGMood, NSynthI, NSynthP, VocalSetS, VocalSetT.

The video shows that the one-for-all representation from MERT backbone can be probed on multiple music understanding tasks. The lightweight model can be inferenced in stream and support different tasks simultaneously.